Goto

Collaborating Authors

 conflicting information




How Do Vision-Language Models Process Conflicting Information Across Modalities?

Hua, Tianze, Yun, Tian, Pavlick, Ellie

arXiv.org Artificial Intelligence

AI models are increasingly required to be multimodal, integrating disparate input streams into a coherent state representation on which subsequent behaviors and actions can be based. This paper seeks to understand how such models behave when input streams present conflicting information. Focusing specifically on vision-language models, we provide inconsistent inputs (e.g., an image of a dog paired with the caption "A photo of a cat") and ask the model to report the information present in one of the specific modalities (e.g., "What does the caption say / What is in the image?"). We find that models often favor one modality over the other, e.g., reporting the image regardless of what the caption says, but that different models differ in which modality they favor. We find evidence that the behaviorally preferred modality is evident in the internal representational structure of the model, and that specific attention heads can restructure the representations to favor one modality over the other. Moreover, we find modality-agnostic "router heads" which appear to promote answers about the modality requested in the instruction, and which can be manipulated or transferred in order to improve performance across datasets and modalities. Together, the work provides essential steps towards identifying and controlling if and how models detect and resolve conflicting signals within complex multimodal environments.


Contradiction Detection in RAG Systems: Evaluating LLMs as Context Validators for Improved Information Consistency

Gokul, Vignesh, Tenneti, Srikanth, Nakkiran, Alwarappan

arXiv.org Artificial Intelligence

Retrieval Augmented Generation (RAG) systems have emerged as a powerful method for enhancing large language models (LLMs) with up-to-date information. However, the retrieval step in RAG can sometimes surface documents containing contradictory information, particularly in rapidly evolving domains such as news. These contradictions can significantly impact the performance of LLMs, leading to inconsistent or erroneous outputs. This study addresses this critical challenge in two ways. First, we present a novel data generation framework to simulate different types of contradictions that may occur in the retrieval stage of a RAG system. Second, we evaluate the robustness of different LLMs in performing as context validators, assessing their ability to detect contradictory information within retrieved document sets. Our experimental results reveal that context validation remains a challenging task even for state-of-the-art LLMs, with performance varying significantly across different types of contradictions. While larger models generally perform better at contradiction detection, the effectiveness of different prompting strategies varies across tasks and model architectures. We find that chain-of-thought prompting shows notable improvements for some models but may hinder performance in others, highlighting the complexity of the task and the need for more robust approaches to context validation in RAG systems.


A MapReduce Approach to Effectively Utilize Long Context Information in Retrieval Augmented Language Models

Zhang, Gongbo, Xu, Zihan, Jin, Qiao, Chen, Fangyi, Fang, Yilu, Liu, Yi, Rousseau, Justin F., Xu, Ziyang, Lu, Zhiyong, Weng, Chunhua, Peng, Yifan

arXiv.org Artificial Intelligence

While holding great promise for improving and facilitating healthcare, large language models (LLMs) struggle to produce up-to-date responses on evolving topics due to outdated knowledge or hallucination. Retrieval-augmented generation (RAG) is a pivotal innovation that improves the accuracy and relevance of LLM responses by integrating LLMs with a search engine and external sources of knowledge. However, the quality of RAG responses can be largely impacted by the rank and density of key information in the retrieval results, such as the "lost-in-the-middle" problem. In this work, we aim to improve the robustness and reliability of the RAG workflow in the medical domain. Specifically, we propose a map-reduce strategy, BriefContext, to combat the "lost-in-the-middle" issue without modifying the model weights. We demonstrated the advantage of the workflow with various LLM backbones and on multiple QA datasets. This method promises to improve the safety and reliability of LLMs deployed in healthcare domains.


Open Domain Question Answering with Conflicting Contexts

Liu, Siyi, Ning, Qiang, Halder, Kishaloy, Xiao, Wei, Qi, Zheng, Htut, Phu Mon, Zhang, Yi, John, Neha Anna, Min, Bonan, Benajiba, Yassine, Roth, Dan

arXiv.org Artificial Intelligence

Open domain question answering systems frequently rely on information retrieved from large collections of text (such as the Web) to answer questions. However, such collections of text often contain conflicting information, and indiscriminately depending on this information may result in untruthful and inaccurate answers. To understand the gravity of this problem, we collect a human-annotated dataset, Question Answering with Conflicting Contexts (QACC), and find that as much as 25% of unambiguous, open domain questions can lead to conflicting contexts when retrieved using Google Search. We evaluate and benchmark three powerful Large Language Models (LLMs) with our dataset QACC and demonstrate their limitations in effectively addressing questions with conflicting information. To explore how humans reason through conflicting contexts, we request our annotators to provide explanations for their selections of correct answers. We demonstrate that by finetuning LLMs to explain their answers, we can introduce richer information into their training that guide them through the process of reasoning with conflicting contexts.


Toward Robust RALMs: Revealing the Impact of Imperfect Retrieval on Retrieval-Augmented Language Models

Park, Seong-Il, Lee, Jay-Yoon

arXiv.org Artificial Intelligence

Retrieval Augmented Language Models (RALMs) have gained significant attention for their ability to generate accurate answer and improve efficiency. However, RALMs are inherently vulnerable to imperfect information due to their reliance on the imperfect retriever or knowledge source. We identify three common scenarios-unanswerable, adversarial, conflicting-where retrieved document sets can confuse RALM with plausible real-world examples. We present the first comprehensive investigation to assess how well RALMs detect and handle such problematic scenarios. Among these scenarios, to systematically examine adversarial robustness we propose a new adversarial attack method, Generative model-based ADVersarial attack (GenADV) and a novel metric Robustness under Additional Document (RAD). Our findings reveal that RALMs often fail to identify the unanswerability or contradiction of a document set, which frequently leads to hallucinations. Moreover, we show the addition of an adversary significantly degrades RALM's performance, with the model becoming even more vulnerable when the two scenarios overlap (adversarial+unanswerable). Our research identifies critical areas for assessing and enhancing the robustness of RALMs, laying the foundation for the development of more robust models.


Embrace Divergence for Richer Insights: A Multi-document Summarization Benchmark and a Case Study on Summarizing Diverse Information from News Articles

Huang, Kung-Hsiang, Laban, Philippe, Fabbri, Alexander R., Choubey, Prafulla Kumar, Joty, Shafiq, Xiong, Caiming, Wu, Chien-Sheng

arXiv.org Artificial Intelligence

Previous research in multi-document news summarization has typically concentrated on collating information that all sources agree upon. However, to our knowledge, the summarization of diverse information dispersed across multiple articles about an event has not been previously investigated. The latter imposes a different set of challenges for a summarization model. In this paper, we propose a new task of summarizing diverse information encountered in multiple news articles encompassing the same event. To facilitate this task, we outlined a data collection schema for identifying diverse information and curated a dataset named DiverseSumm. The dataset includes 245 news stories, with each story comprising 10 news articles and paired with a human-validated reference. Moreover, we conducted a comprehensive analysis to pinpoint the position and verbosity biases when utilizing Large Language Model (LLM)-based metrics for evaluating the coverage and faithfulness of the summaries, as well as their correlation with human assessments. We applied our findings to study how LLMs summarize multiple news articles by analyzing which type of diverse information LLMs are capable of identifying. Our analyses suggest that despite the extraordinary capabilities of LLMs in single-document summarization, the proposed task remains a complex challenge for them mainly due to their limited coverage, with GPT-4 only able to cover less than 40% of the diverse information on average.


Study shows the brains of 'gritty' people work differently from people who are less perserverant

Daily Mail - Science & tech

When asked to think of someone really'gritty', an image of US marshal Rooster Cogburn in the film'True Grit' may spring to mind. Throughout the story, Cogburn demonstrates an unwavering determination to catch the criminals and fugitives he pursues, and protect 14-year-old Mattie, who has hired him to track down her father's killer. Now a new study has found that the brains of'gritty' people like Cogburn work differently to those who are less perseverant towards their goals. It found that people who are more determined to achieve their long-term goals find it easier to consider all available information while remaining sensitive to new conflicting information. This may help them be more aware of the presence of conflicting goals in their everyday life that could take them off-track from their longer term ones.


How To Make AI Trustworthy - USC Viterbi

#artificialintelligence

One of the biggest impediments to adoption of new technologies is trust in AI. Now, a new tool developed by USC Viterbi Engineering researchers generates automatic indicators if data and predictions generated by AI algorithms are trustworthy. Their research paper, "There Is Hope After All: Quantifying Opinion and Trustworthiness in Neural Networks" by Mingxi Cheng, Shahin Nazarian and Paul Bogdan of the USC Cyber Physical Systems Group, was featured in Frontiers in Artificial Intelligence. Neural networks are a type of artificial intelligence that are modeled after the brain and generate predictions. But can the predictions these neural networks generate be trusted?